{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analyzing the GIFGIF dataset\n",
    "\n",
    "[GIFGIF](http://www.gif.gf/) is a project from the MIT media lab that aims at understanding the emotional content of animated GIF images.\n",
    "The project covers 17 emotions, including *happiness*, *fear*, *amusement*, *shame*, etc.\n",
    "To collect feedback from users, the web site shows two images at a time, and asks feedback as follows.\n",
    "\n",
    "> Which of the left or right image better expresses `[emotion]` ?\n",
    "\n",
    "where `[emotion]` is one of 17 different possibilities.\n",
    "Therefore, the raw data that is collected consists of outcomes of pairwise comparison between pairs of images.\n",
    "*Just the kind of data that `choix` is built for*!\n",
    "\n",
    "In this notebook, we will **use `choix` to try making sense of the raw pairwise-comparison data**.\n",
    "In particular, we would like to embed the images on a scale (for a given emotion).\n",
    "\n",
    "### Dataset\n",
    "\n",
    "We will use a dump of the raw data available at <http://lucas.maystre.ch/gifgif-data>.\n",
    "Download and uncompress the dataset (you don't need to download the images)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import choix\n",
    "import collections\n",
    "import numpy as np\n",
    "\n",
    "from IPython.display import Image, display"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Change this with the path to the data on your computer.\n",
    "PATH_TO_DATA = \"/tmp/gifgif/gifgif-dataset-20150121-v1.csv\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also define a short utility function to display an image based on its identifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"http://media.giphy.com/media/k39w535jFPYrK/giphy.gif\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "def show_gif(idx):\n",
    "    template = \"http://media.giphy.com/media/{idx}/giphy.gif\"\n",
    "    display(Image(url=template.format(idx=idx)))\n",
    "    \n",
    "# A random image.\n",
    "show_gif(\"k39w535jFPYrK\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Processing the raw data\n",
    "\n",
    "First, we need to transform the raw dataset into a format that `choix` can process.\n",
    "Remember that `choix` encodes pairwise-comparison outcomes as tuples `(i, j)` (meaning \"$i$ won over $j$\"), and that items are assumed to be numbered by consecutive integers.\n",
    "\n",
    "We begin by mapping all distinct images that appear in the dataset to consecutive integers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of distinct images: 6,170\n"
     ]
    }
   ],
   "source": [
    "# First pass over the data to transform GIFGIF IDs to consecutive integers.\n",
    "image_ids = set()\n",
    "with open(PATH_TO_DATA) as f:\n",
    "    next(f)  # First line is header.\n",
    "    for line in f:\n",
    "        emotion, left, right, choice = line.strip().split(\",\")\n",
    "        if len(left) > 0 and len(right) > 0:\n",
    "            # `if` condition eliminates corrupted data.\n",
    "            image_ids.add(left)\n",
    "            image_ids.add(right)\n",
    "int_to_idx = dict(enumerate(image_ids))\n",
    "idx_to_int = dict((v, k) for k, v in int_to_idx.items())\n",
    "\n",
    "n_items = len(idx_to_int)\n",
    "print(\"Number of distinct images: {:,}\".format(n_items))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we parse the comparisons in the data and convert the image IDs to the corresponding integers.\n",
    "We collect all the comparisons and filter them by emotion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of comparisons for each emotion\n",
      "pleasure        87,941\n",
      "contentment     71,493\n",
      "embarrassment   49,579\n",
      "surprise        65,030\n",
      "sadness         64,631\n",
      "excitement      82,054\n",
      "happiness      106,877\n",
      "pride           50,394\n",
      "disgust         60,361\n",
      "shame           47,099\n",
      "amusement       76,653\n",
      "anger           65,880\n",
      "relief          39,609\n",
      "contempt        50,976\n",
      "guilt           44,971\n",
      "satisfaction    79,612\n",
      "fear            51,703\n"
     ]
    }
   ],
   "source": [
    "data = collections.defaultdict(list)\n",
    "with open(PATH_TO_DATA) as f:\n",
    "    next(f)  # First line is header.\n",
    "    for line in f:\n",
    "        emotion, left, right, choice = line.strip().split(\",\")\n",
    "        if len(left) == 0 or len(right) == 0:\n",
    "            # Datum is corrupted, continue.\n",
    "            continue\n",
    "        # Map ids to integers.\n",
    "        left = idx_to_int[left]\n",
    "        right = idx_to_int[right]\n",
    "        if choice == \"left\":\n",
    "            # Left image won the comparison.\n",
    "            data[emotion].append((left, right))\n",
    "        if choice == \"right\":\n",
    "            # Right image won the comparison.\n",
    "            data[emotion].append((right, left))\n",
    "            \n",
    "print(\"Number of comparisons for each emotion\")\n",
    "for emotion, comps in data.items():\n",
    "    print(\"{: <14} {: >7,}\".format(emotion, len(comps)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parameter inference\n",
    "\n",
    "Now, we are ready to fit a [Bradley-Terry model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) to the data, in order to be able to embed the images on a quantitative scale (for a given emotion).\n",
    "In the following, we consider *happiness*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(5081, 4390), (3685, 3717), (2774, 3672)]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# What does the data look like?\n",
    "data[\"happiness\"][:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 3min, sys: 3.32 s, total: 3min 3s\n",
      "Wall time: 1min 43s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "params = choix.opt_pairwise(n_items, data[\"happiness\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The parameters induce a ranking over the images.\n",
    "Images ranked at the bottom are consistently found to express *less* happiness, and vice-versa for images ranked at the top."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ranking = np.argsort(params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualizing the results\n",
    "\n",
    "The top three images that best express happiness are the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"http://media.giphy.com/media/MazO89pt5NljG/giphy.gif\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<img src=\"http://media.giphy.com/media/IBObAQW75gYne/giphy.gif\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<img src=\"http://media.giphy.com/media/12PIT4DOj6Tgek/giphy.gif\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "for i in ranking[::-1][:3]:\n",
    "    show_gif(int_to_idx[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The top three images that *least express happiness are the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"http://media.giphy.com/media/kEKXob9NSB1cY/giphy.gif\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<img src=\"http://media.giphy.com/media/2S2uEj3bvhIBi/giphy.gif\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<img src=\"http://media.giphy.com/media/13DlTJDBoNcrFS/giphy.gif\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "for i in ranking[:3]:\n",
    "    show_gif(int_to_idx[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predicting future comparison outcomes\n",
    "\n",
    "Based on the model learnt from the data, it is also possible to predict what a user would select as \"better expressing happiness\" for any pair of images.\n",
    "Below is an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"http://media.giphy.com/media/sofgI5umz1Vjq/giphy.gif\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<img src=\"http://media.giphy.com/media/zy0JIpL0niHG8/giphy.gif\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "rank = 2500\n",
    "top = ranking[::-1][rank]\n",
    "show_gif(int_to_idx[top])\n",
    "\n",
    "bottom = ranking[rank]\n",
    "show_gif(int_to_idx[bottom])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prob(user selects top image) = 0.80\n"
     ]
    }
   ],
   "source": [
    "prob_top_wins, _ = choix.probabilities((top, bottom), params)\n",
    "print(\"Prob(user selects top image) = {:.2f}\".format(prob_top_wins))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
